Distributional Similarity, Phase Transitions and Hierarchical Clustering
نویسندگان
چکیده
We describe a method for automatically clustering words according to their distribution in particular syntactic contexts. Words are represented by the relative frequency distributions of contexts in which they appear, and relative entropy is used to measure the dissimilarity of those distributions. Clusters are represented by "typical" context distributions averaged from the given words according to their probabilities of cluster membership, and in many cases can be thought of as encoding coarse sense distinctions. Deterministic annealing is used to find lowest distortion sets of clusters. As the annealing parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data.
منابع مشابه
‘Over reference’: a comparative study on German prefix-verbs
• Experiment: Hierarchical clustering of 4 × 10 prefix-verbs on über (over). We extracted vector representations for all items in our dataset (derived and simple verbs) by relying on a state-of-the-art technique (cf. Mikolov et al. [2013] continuous bag-ofwords representation). The distributional semantic model on which our experiment was conducted was extracted from the SdeWac corpus (cf. Faaß...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملA New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کاملA New Co-similarity Measure : Application to Text Mining and Bioinformatics. (Une Nouvelle Mesure de Co-Similarité : Applications aux Données Textuelles et Génomique)
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and there exist a multitude of different clustering algorithms for different settings. As datasets become larger and more varied, adaptations of existing algorithms are required to maintain the quality of clus...
متن کاملResearch Interests João Sedoc Description of Work
Presently my main research interest is the development and application of machine learning and statistical techniques toward natural language processing. The representation of words using vector space models is widely used for a variety of natural language processing (NLP) tasks. The two main word embedding categories are cluster based and dense representations. Brown Clustering and other hiera...
متن کامل